Links¶

Logic Gates

http://www.ee.surrey.ac.uk/Projects/CAL/digital-logic/gatesfunc/index.html#logicgates

Multi-class keras tutorial

https://machinelearningmastery.com/multi-class-classification-tutorial-keras-deep-learning-library/

Step by step backpropigation

https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

Activation functions

https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

Basics

Dendrites pick up signals from neighboring neurons, the soma adds up all the signals, and the axon sends the signals to neighboring cells. Likewise we have billions of neurons in a human brain, and these billions of neurons are connected in a complex structure and this structure allows our brain to function. So, inspired by the biological system, we can have a mathematical representation of a neuron and train the machine. The fundamental building block of a neural network is called perceptron.

Logical Gates - take 2 binary inputs and gives and output

The neuron performs two tasks:

It first does the summation of the inputs ⅀Xi Then apply the step function, and it checks the following conditions. If ⅀Xi >= Θ , it outputs 1

If ⅀Xi < Θ, it outputs 0

This perceptron will be able to model different logical gates. If Θ = 1, then it works as an OR gate, where any one of the inputs should be 1, and if Θ = 3, it will be equivalent to an AND gate, where all the three inputs should be equal to 1. Some inputs can be marked as inhibitory inputs as well. If any of the inputs is inhibitory, then they immediately inhibit the output from activating.

In [1]:
from IPython.display import Image, display

# Path to your image file
image_path = 'inputsoutputs.png'

# Display the image in the notebook
display(Image(filename=image_path))

This perceptron model was the first mathematical representation of a human neuron. The McCulloch-Pitts perceptron had some limitations, such as taking only binary inputs and using only a threshold step activation function, which was ineffective at learning anything about the data.

In 1958, Rosenblatt came up with another version of the perceptron that is being used today. This perceptron can take any number, not necessarily binary. All the inputs have some weights associated with them, and instead of taking the summation, it takes the weighted summation and applies a nonlinear function to generate the output. The perceptron is illustrated below.

In [2]:
# Path to your image file
image_path = 'perceptron.png'

# Display the image in the notebook
display(Image(filename=image_path))

The purpose of adding a bias term to the perceptron is to have theta be a part of the weights rather than part of the function. The bias term is added to the perceptron, which is always 1, and the weights are adjusted by keeping theta always 0. With a bias term, the structure of the perceptron would look like this.

Imagine Perceptron as a decision-making friend. We give Perceptron important things to think about, called "weights." We also give it a nudge button called "bias" that's always 1, so Perceptron starts thinking even when there's nothing to think about.

There's a secret rule called "theta" for making decisions. Instead of telling Perceptron separately, we mix it with the weights and set it as 0. This way, Perceptron still remembers the rule but it doesn't really change things.

With weights, bias, and a tiny bit of memory for the rule, Perceptron can make smarter decisions about good and bad stuff in a special way. That's why we add a bias term – to help Perceptron do its job better!

In [3]:
# Path to your image file
image_path = 'bias.png'

# Display the image in the notebook
display(Image(filename=image_path))

Consider the following statements about the McCulloch-Pitts Neuron Model: The inputs and outputs are binary.

The inputs of the McCulloch-Pitts neuron could be either 0 or 1. It has a threshold function as an activation function. So, the output signal (y-out) is 1 if the input (y-sum) is greater than or equal to a given threshold value, else 0. The number of inputs to a neuron can be many for McCulloch-Pitts, but the output should be only one.

Which of the below statements are true with respect to the Rosenblatt Perceptron:

 I. It can process non-boolean inputs.
 II. It can assign different weights to each input automatically.

A Rosenblatt perceptron works by taking in some numerical inputs along with what is known as weights and a bias. It then multiplies these inputs with the respective weight(this is known as the weighted sum). These products are added together with a bias and apply a function on the weighted sum.

Perceptrons are the models which are inspired by the human brain, but it is not the mathematical representation of the human brain.

So you have y, which is the goal to predict

You have something like 60,000 rows of images to train the NN

Each image gets grided, each cell in the grid gets fed into the NN, the goal for each cell is to have an output where everything is 0 (or false) except for one output (which is true). The yhat is the output, and the goal is to find the gradients or darivatise with respect to each w, or cell input.

We want to see how yhat changes, how fast it changes with respect to each of my w's. Use the gradients or darivatives to adjust my WEIGHTS a little so that yhat becomes a little closer to y.

the w is a node in the NN

This is a FEED WORWARD structure. left to right (sometimes bottom to up)

This is a fully connected layer

The most popular dataset used in neural networks is the MNIST (Modified National Institute of Standards and Technology) dataset. It contains 60,000 handwritten images of the digits from 0 to 9 in the training dataset. Each image is 28*28 pixels of 0s and 1s (Pixels are scaled), where 0 stands for black and 1 stands for white. A model can be trained on this data to recognize hand-written numbers

A 28*28 image gives 784 inputs, and the target variable has a value from 0 to 9, so the output layer will have 10 nodes, each node corresponding to a digit from 0 to 9. The number of neurons and hidden layers are the hyperparameters in Neural Networks that can be changed.

In [4]:
# Path to your image file
image_path = 'NetworkTraining.png'

# Display the image in the notebook
display(Image(filename=image_path))

The output from the above model will give 10 numbers, and each number represents the probability of the number being 0, 1, 2 … 9, and it is called predicted y. The predicted values should be close to the actual values, so to achieve this derivatives/gradients of predicted y with respect to all the weights are calculated. From derivatives, how predicted y changes or in which direction it changes with respect to each of the weights is determined. So, derivatives/gradients are used to adjust the weights so that the predicted y becomes close to the actual y. Multiple perceptrons, when put together in a well-designed structure and more importantly trained with the relevant data in the right manner, can show signs of being intelligent.

The structure that is built above is known as Feed Forward, and it essentially feeds the input, and everything computed in a given layer is moved to the next layer, and so on. When every node in each layer is connected to every node in the next immediate layer, the structure is known as a Fully Connected Neural Network.

When trianing you leverage the derivatives of the yhat with respect to y

loss function - how do you define yhat being close to y or not. yhat is a vector. YOu can decide what loss functions to use how do they impact yhat?

backpropigation¶

gradient descent - finding the right way to find the weights

overfitting - is possible in NN. They are robust agains toverfitting, but it is possible. There is a learing rate that only sllightly tweeks the weights so overfitting should relaly not be a huge concern.

non-convex optimization problem - there are many solutions that the NN can converge to, and they are called 'Local Optimas'. this is non always the best solution, but is good, and is not overfit

CNN is AI with multiple layers

Deep NN - is NN on steroids

  1. Xi : ith input
  2. Wij : represents the weight associated with ith input and jth perceptron (output for a layer)
  3. bi : bias in the ith layer
  4. Zi = WijXi + bi : output for a perceptron. Applying an activation function -> F( WijXi + bi)

The above elements associated with a perceptron can be represented in a matrix form.

The output of the perception is represented in a matrix form as (Zi = \digamma (W^{1} *X + b\prime )) where W[l], X, and b are matrices and represented as:

X : [x1,x2,x3..xn] is the inputs matrix W [i] : Weight matrix, superscript [i] indicates the layer

  W1 = [[w11. w21. w31 ..][w12. w22. w32 ..]

LaTeX: W^{1} indicates the weight matrix for the first layer where each element (wij) represents the weight associated with ith input and jth output perceptron. The dimension of the matrix would be h1d comprising h1 rows (number of input) and d columns (output perception). b : Is the matrix that contains bias. Output from the first hidden layer becomes the input for the second hidden layer and so on till we reach the last layer. Similarly, LaTeX: Zi^{2} = \digamma (W^{2} (\digamma (W\prime * X + b) + b) is the output for the second hidden layer and so on till the last layer.

  1. Y : Final output generated at the last layer.
In [3]:
from IPython.display import Image, display

# Path to your image file
image_path = 'nnarch.png'

# Display the image in the notebook
display(Image(filename=image_path))
In [5]:
import numpy as np

# Example matrices as NumPy arrays
matrix_a = np.array([[1, 3, 5], [2, 7,9],[6,8,4]])
matrix_b = np.array([1, 3, 5])

result_matrix = np.dot(matrix_a, matrix_b)
print(result_matrix)
[35 68 50]

Sigmoid = 1 / 1+e^-x

Sigmoid goes from 0 to 1

tanh = 2sigma(2x)-1

tanh (hyperbolic tan) - is a compressed version of the sigmoid. It goes from -1 to 1

ReLu - rectified linear unit - not smooth. (most popular activation function) looks like a right leaning V. it outputs the input value x if x is greater than or equal to zero, and it outputs zero if x is negative. Visually, the ReLU function looks like a straight line with a slope of 1 for all positive values of x and a flat line at zero for all negative values of x.

Type of Activation Functions¶

1. Step Function: One of the most basic categories of activation functions is the Step Function. A threshold value is used in this case, and if the net input y value is greater than the threshold, the neuron is activated.¶

A binary step function is a threshold-based activation function, which means that it activates when a certain threshold is reached and deactivates when it falls below that point. Because there is a sharp jump in the function, the derivative at x=o will give a large number, but that large number is not useful for learning and cannot be used for multi-class classification. Smooth functions are good for learning where there are defined slopes and this step function is a sharp function.

2. Sigmoid Activation: This activation function looks like a smoother version of a step function where the slopes at each point are well defined, and this smoothness is caused by using the exponential function.¶

The sigmoid function, also known as the logistic function, has an output that ranges from 0 to 1. It makes use of a probabilistic approach. It is a function that is graphed in the shape of a "S." The sigmoid activation function has a value between 0 and 1. Because the values of the Sigmoid function range between 0 and 1, the outcome can be easily predicted to be 1 if the value is greater than 0.5 and 0 otherwise. In the output layer of a binary classification, Sigmoid function values are typically used, with the result being either 0 or 1.

3. Tanh Activation: Tanh activation function is a scaled and compressed version of sigmoid activation function and is also called a hyperbolic tangent. The structure of this function is similar to the sigmoid activation function, but significantly superior because it allows for negative outputs and has an output range of -1 to 1.¶

4. ReLU Activation: Rectified linear activation function is the most commonly used activation function in the hidden layer of a neural network which ranges between 0 to inf.¶

if X>=0, X

if X<0, 0

If x is positive, it outputs x, and if not, it outputs 0. The ReLU activation function has a range of 0 to inf.

The advantage of ReLU is that it requires fewer mathematical operations than tanh and sigmoid, making it less computationally expensive.

The disadvantage of ReLu is that it produces dead neurons, which never activate, known as the dying ReLu problem.

ReLU(x) = max(0, x)

5. Softmax: The softmax function is often described as a combination of multiple sigmoids. The sigmoid activation function returns values between 0 and 1, which are the probabilities of each of the data points belonging to a particular class. Thus, sigmoid is widely used for binary classification problems.¶

But the softmax activation function is used in the output layer of multi-class classification problems, where this function returns the probability for a datapoint belonging to each individual class. The range of softmax is -inf to inf. Below is the mathematical expression for softmax.

The function normalizes the outputs for each class between 0 and 1 and adds the normalized sum for each class. The output for a particular class is its normalized output divided by the total normalized sum.

Softmax is an activation function that scales numbers/logits into probabilities. The output of a Softmax is a vector (say v ) with probabilities of each possible outcome. The probabilities in vector v sums to one for all possible outcomes or classes.

Mathematically, the softmax function is defined as follows:

Given an input vector of raw scores or logits, denoted as Z = [z_1, z_2, ..., z_n], where n is the number of classes, the softmax function computes the probability P(y=i) that a given input belongs to class i as follows: P(y=i) = exp(z_i) / (exp(z_1) + exp(z_2) + ... + exp(z_n))

In words, the softmax function exponentiates each element of the input vector (to make them positive) and then normalizes them by dividing by the sum of all the exponentiated values. This normalization ensures that the resulting probabilities sum to 1, making it a valid probability distribution.

The softmax activation function and the ReLU (Rectified Linear Unit) activation function serve different purposes in a neural network and are typically used in different parts of the network. They are not interchangeable because they perform distinct tasks.

6. Linear activation function: The linear activation function is one in which the activation is proportional to the input. This function does nothing to the weighted sum of the input and returns the value it was given.¶

The activation function of the last layer is merely a linear function of the input from the first layer, regardless of how many layers there are, assuming they are all linear. The linear activation function has a range of -inf to +inf. The neural network's last layer will operate as a linear function of the first layer.

Linear activation functions, often referred to as identity activation functions, are one type of activation function used in neural networks. These functions simply output the same value as their input without any transformation. In mathematical terms, the linear activation function is defined as: f(x) = x

7. Leaky ReLU: The Leaky ReLU function is an improved version of the ReLU activation function. The gradient of the ReLU activation function is 0 for all input values less than zero, which deactivates the neurons in that region and may cause the dying ReLU problem.¶

Leaky ReLU is defined to address this problem. Instead of defining the ReLU activation function as 0 for negative values of inputs (x), we define it as an extremely small linear component of x. Here is the formula for this activation function

f(x)=max(0.01*x , x)

If the above function receives any positive input, it returns x; otherwise, it returns a small value equal to 0.001 times x. As a result, it produces an output for any negative values. By making this small change, the gradient on the left side of the above graph becomes non-zero. As a result, it would no longer encounter dead neurons in that area.

ReLu has and issue of dying or dead neurons so we can modify the activation function slightly to deal with this¶

Sigmoid and tanh output nodes is for classification problems

For regression we use linear output nodes (but won't work with hidden layers)

Softmax Activation Function:¶

Purpose: The softmax activation function is mainly used in the output layer of a neural network, especially in multi-class classification tasks. It transforms raw scores (logits) into a probability distribution over multiple classes. It ensures that the output of the network represents class probabilities, and the sum of these probabilities across all classes equals 1.

Output: The output of the softmax function is a set of class probabilities, one for each class. It's used to make predictions and measure the likelihood of the input belonging to each class.

Range: The output of softmax is a probability distribution, so values range between 0 and 1, and the sum of all class probabilities is 1.

ReLU Activation Function:¶

Purpose: The ReLU activation function is typically used in the hidden layers of a neural network, not in the output layer. Its primary purpose is to introduce non-linearity to the model. ReLU allows the network to learn complex, non-linear relationships within the data. It's particularly effective in addressing the vanishing gradient problem.

Output: The output of the ReLU function is the input value if it's positive, and it's zero if it's negative.

Range: ReLU outputs values in the range [0, ∞) for positive inputs, and it's zero for negative inputs.

TRAINING¶

Loss Function: MEasures how close one vector is from another vector.¶

In the material the teacher spoke about if we have out observation vecotr and out prediction vector is way off, how can we apply weights to each node to adust it so that the predicted vecotr is closer to out obsered vector. We use the Loss Function to do that.

It is a way to measure the error much like MSE and SSE

A loss function, also known as a cost function or objective function, is a critical component in machine learning and deep learning algorithms. It plays a fundamental role in the training process of a model. The primary purpose of a loss function is to measure how well the predictions of a machine learning model match the actual target values (ground truth) for a given set of input data. It quantifies the error or "loss" between the predicted values and the true values.

Here's how a loss function works:

Input Data: In supervised machine learning, you have a dataset that consists of input data and corresponding target values. The input data represents the features or attributes of the data points, while the target values represent the desired or actual outcomes.¶

Model Predictions: During the training phase, the machine learning model takes the input data and makes predictions. These predictions are based on the model's current set of parameters (weights and biases).¶

Comparison with Ground Truth: The loss function then compares the model's predictions to the actual target values. It computes a single scalar value, the loss, that quantifies the dissimilarity between the predictions and the true values.¶

Optimization Objective: The goal during training is to minimize this loss function. In other words, you want to adjust the model's parameters (weights and biases) to make the loss as small as possible. Minimizing the loss effectively means that the model is getting closer to making accurate predictions.¶

Gradient Descent: Most often, gradient-based optimization techniques like gradient descent are used to minimize the loss function. These methods calculate the gradient (a vector of partial derivatives) of the loss with respect to the model's parameters. The gradient points in the direction of steepest increase in the loss. So, to minimize the loss, you adjust the parameters in the opposite direction of the gradient.¶

Iterative Process: The training process is iterative. You repeatedly feed batches of data into the model, calculate the loss, compute the gradients, and update the model's parameters. Over time, as the optimization process continues, the loss typically decreases, indicating that the model is improving its predictions.¶

The choice of the loss function depends on the specific task and the type of machine learning model being used. Different tasks, such as regression, classification, or generative modeling, require different loss functions. Common loss functions include:

  1. Mean Squared Error (MSE): Used in regression tasks to measure the squared difference between predicted and true values. MSE = 1/n sum(abs(yi - yhati). The ability to differentiate the function is very important to optimize the loss function, and the problem with absolute values is that the function is undefined at x = 0 considering the graph of the absolute function. This absolute function is not efficient for training, because at x=0, it is undefined. Since it will be difficult to differentiate the absolute function, a square of loss function is used in the place of absolute function, which is also called as l2 loss or mean squared error
  2. Cross-Entropy Loss (Log Loss): Commonly used in classification tasks to measure the dissimilarity between predicted class probabilities and true class labels. Penalizes over confidence predicted processes.
  3. Hinge Loss: Used in support vector machines and some classification tasks, it penalizes misclassified samples.
  4. Kullback-Leibler Divergence (KL Divergence): Used in probabilistic models, such as variational autoencoders (VAEs), to measure the difference between predicted and true probability distributions.
  5. L2 Loss: For regression problems, we can use the L2 loss function, but for classification problems, it does not turn out to be the right choice as we observe binary numbers in classification. For classification problems, we use another loss function called Cross Entropy Loss Function defined as Here y is the actual value, and yhat is the predicted value. For classification problems, the possible outcomes will be either 1 or 0, so the actual and predicted outcomes will either be 1 or 0. If the actual value (y) is 1, the first part of the equation will be zero. If the actual value is 0, then the second part of the equation will be zero.

Ex: Case 1: when y = 1 and yhat = 1, the loss will be zero

Case 2: when y = 0 and yhat = 0, the loss will be zero

Case 3: when y = 1 and yhat = 0, the loss will be high

Case 4: when y = 0 and yhat = 1, the loss will be high

problem of local optimum and saddle point¶

In neural networks and optimization problems in general, the terms "local optimum" and "saddle point" refer to two different challenges that optimization algorithms can encounter when trying to find the best set of model parameters (weights and biases). Let's explain each of these concepts:

Local Optimum:

Definition: A local optimum is a point in the parameter space where the loss function (or cost function) has a lower value than in its immediate neighboring points, but it may not be the globally lowest point. In other words, it's a point where the optimization algorithm gets stuck because it can't find a better solution by making small adjustments to the parameters.

Challenge: Local optima can be problematic because they prevent the optimization algorithm from reaching the best possible solution, which might be at a different point in the parameter space (the global optimum).

Solution: To address the issue of getting stuck in local optima, various strategies can be employed, such as using different optimization algorithms (e.g., stochastic gradient descent with various modifications), initializing the parameters differently, or employing techniques like simulated annealing or genetic algorithms.

Saddle Point:

Definition: A saddle point is a point in the parameter space where the gradient of the loss function is zero, but it's not an extremum. In other words, the loss function may have both increasing and decreasing directions around a saddle point. At a saddle point, the optimization algorithm can slow down significantly, as it may have difficulty distinguishing between a true extremum (minimum or maximum) and the saddle point.

Challenge: Saddle points can mislead optimization algorithms, making them converge slowly or become stuck for an extended period. The presence of saddle points can slow down the training of deep neural networks.

Solution: To deal with saddle points, more advanced optimization techniques have been developed, including algorithms that consider not only the gradient but also the curvature of the loss function (e.g., second-order methods like Newton's method or quasi-Newton methods). Additionally, using gradient noise or adding regularization terms can help algorithms escape saddle points more easily.

It's important to note that while local optima and saddle points can pose challenges in optimization, modern deep learning frameworks and algorithms have made significant progress in mitigating these issues. Techniques like mini-batch stochastic gradient descent, adaptive learning rate methods, and various regularization techniques have helped make optimization more robust and efficient, allowing neural networks to find good solutions even in complex loss landscapes. Additionally, the depth and non-linearity of neural networks can make them less susceptible to getting trapped in local optima compared to simpler models.

When this function is differentiated, it gives d/dx = 2x. By differentiating, we find the slope of the function. Similarly, the loss function is taken and differentiated with respect to all the weights.

However, the loss function's behavior is not described because there is no proper expression for a loss function, as there was in the example.

Minimizing the loss function without a proper expression will be difficult, so the loss function can be minimized by adjusting the weights used while training the model with the provided data.

For example: Consider the following data which has 4 data points with target variables assigned to it. When the given data is passed to a neural network model, it outputs the predicted values.

In [8]:
from IPython.display import Image, display

# Path to your image file
image_path = 'dp.png'

# Display the image in the notebook
display(Image(filename=image_path))

The function that is used to compute this error is known as Loss Function. The loss or error is calculated using the below loss function.

1/n sum(y-yhati)^2

The error when data is passed to the network is plotted with the weight in the plot below. pic 1

In [9]:
from IPython.display import Image, display

# Path to your image file
image_path = 'gradient.png'

# Display the image in the notebook
display(Image(filename=image_path))

Where the learning rate is a tuning parameter that determines the step size at each iteration while moving toward a minimum of a loss function. If the learning rate is very small, it will take too long to reach the minima, and if it’s too large, then it will oscillate, and the minima will be missed. The mechanism to update the weights and minimize the loss function is known as the Gradient Descent technique.

While dealing with the non-convex function, as shown below, the gradient descent technique might get stuck at local minima and will not be able to reach global minima. But even finding the local minima allows the neural network to perform well.

Local Minima: Local minima refers to the point at which the value of the function is smaller than that at nearby points, however, it could be bigger than that at a distant point.

Global Minima: Global minima refers to the point where the function value is smaller than at all other feasible points.

Convex Function: A convex function that has only one minimum which is both local and global minima.

In [10]:
from IPython.display import Image, display

# Path to your image file
image_path = 'global.png'

# Display the image in the notebook
display(Image(filename=image_path))

Since there are many parameters in a neural network to train, like weights and bias, the loss function ideally will not be convex, but it will be a non-convex function. Another disadvantage of Gradient Descent is that it converges to minima slowly.

If the learning rate is too high the neural network will NOT converge to the minima

Backpropigation¶

Backpropigation allows you to be efficient.

Forward propagation is how neural networks make predictions. Input data is “forward propagated” through the network layer by layer to the output layer, which makes a prediction.

In backpropagation, we propagate through the neural network backward, i.e., from the output layer to the input layer, and update the weights and biases of the neural network.

Let's understand this in detail with the help of an example.

Let’s consider a neural network with “N” inputs and a single neuron in the hidden layer

Backpropagation use chain rule to update the weights and biases. After each forward pass through the network, backpropagation performs backward pass by adjusting weights and biases of the network. Thus it helps in reducing the error rate of the loss function with respect to each weight of the network.

In [11]:
from IPython.display import Image, display

# Path to your image file
image_path = 'one.png'

# Display the image in the notebook
display(Image(filename=image_path))

Step 1: Forward Propagation¶

Step A: In forward propagation, the data points LaTeX: x_1,\:x_2,\:...,\:x_N from the input layer are propagated to a single neuron where each input is multiplied with its respective weights and then summed together. Each neuron has also an error term called bias. The sum of the bias term and the linear combination of inputs and weights is the input to the single neuron as shown in the below image:

In [12]:
from IPython.display import Image, display

# Path to your image file
image_path = 'two.png'

# Display the image in the notebook
display(Image(filename=image_path))

Step B:¶

In this step, we apply a nonlinear function to this linear combination. The functions we apply to these linear combinations are also known as Activation Functions. Activation Functions are supposed to introduce nonlinearity into our Neural Network. Simple linear functions in neural networks might not be helpful in learning complex patterns in data, hence we use non-linear activation functions to be able to learn complex patterns in our data.

In [13]:
from IPython.display import Image, display

# Path to your image file
image_path = 'three.png'

# Display the image in the notebook
display(Image(filename=image_path))

In the image above, we have applied a sigmoid function which is one of the activation functions.

What if we have multiple neurons and multiple layers? If we have multiple layers, each neuron receives the outputs of the previous layer as its inputs. For example, you can see in the below image that the neurons 'A11', 'A12', 'A13' and 'A14' receive 'x1', 'x2' and 'x3' as the inputs. And sequentially the neurons 'A21', 'A22', 'A23', and 'A24' receive the outputs of 'A11', 'A12', 'A13', and 'A14' as their inputs. Each neuron drawn in the below image is an encapsulated representation of the image above, i.e., each neuron is supposed to represent the linear equation and the activation function clubbed together.

In [14]:
from IPython.display import Image, display

# Path to your image file
image_path = 'four.png'

# Display the image in the notebook
display(Image(filename=image_path))

Step 2: Calculate the Loss Function¶

We use a loss function to determine the loss/error (actual - predicted) after we receive an output from the feedforward. The objective is to minimize loss and this is achieved by adjusting the weights and biases. There can be different types of loss functions depending on the problem we are trying to solve. For example, we use Mean Squared Error for regression and cross-entropy for classification problems.

Step 3: Backpropagation¶

We try to reflect the loss/error onto the weights of our Neural Network. Thus the way to do it is, we take the derivative of the loss/error with respect to a particular weight, and then we shift the value of the weights in that direction. Where C is the loss/error term and w is the weight we want to modify.

The algorithms used to update the weights and biases are known as Optimizers. A few well-known optimizers are Gradient Descent, SGD, Adagrad, RMSprop and Adam, which will be discussed later.

Step 4: Repeat Forward Propagation and Backward Propagation until the cost function is minimized.¶

We repeat Forward Propagation and Backward Propagation until the loss/cost function is minimized.

The below graphic representation shows a single iteration of forward and backward propagation. In forward propagation, first, calculate the value for each node using the input layer and the activation functions. Secondly, make the predictions using the output layer and calculate the error/loss function using the predicted and the actual labels. In backward propagation, the weights and biases are updated using derivatives to optimize the loss function.

In [15]:
from IPython.display import Image, display

# Path to your image file
image_path = 'five.png'

# Display the image in the notebook
display(Image(filename=image_path))

Optimizers¶

Optimizers are algorithms or methods used to change the parameters of the neural network, such as weights and learning rate to reduce loss. Some of the optimizers are Gradient Descent, Stochastic Gradient Descent, etc.

Apart from the optimizers mentioned previously, there are other optimizers that can assist us in optimizing the loss function. The following optimizers will be discussed in the next video.

RMSprop: In neural network training, RMSprop is a gradient-based optimization strategy. Instead of considering the learning rate as a hyperparameter, RMSprop uses an adjustable learning rate. This indicates that the rate of learning fluctuates with time.¶

Adam: Adam optimizer is a combination of two gradient descent methodologies, i.e. RMSprop and momentum. Adam is one of the popular optimizer algorithms that is widely used in training a neural network.¶

There are various other optimizers that will be discussed in detail in the next week.

When to use each classification function¶

Tanh (Hyperbolic Tangent):

Usage: Tanh is often used in hidden layers of neural networks for classification problems. Characteristics: Tanh squashes input values to be in the range [-1, 1], which makes it suitable for problems where the data is centered around zero. It's zero-centered, which can help with optimization in some cases compared to the sigmoid function. Advantages: It can handle negative and positive input values and is useful for problems where the output can be positive or negative, not just binary. Example: It can be used for sentiment analysis, where the sentiment can be either positive or negative. Softmax:

Usage: Softmax is commonly used in the output layer for multi-class classification problems (where you have more than two classes). Characteristics: Softmax transforms the raw scores or logits into a probability distribution over multiple classes. It's useful when you need to assign probabilities to multiple classes and ensure that the sum of these probabilities equals 1. Advantages: It's well-suited for problems like image classification, where an input image can belong to one of several possible categories. ReLU (Rectified Linear Unit):

Usage: ReLU is commonly used in hidden layers for various types of classification problems. Characteristics: ReLU activation is computationally efficient and helps with the vanishing gradient problem. It replaces all negative values with zero while leaving positive values unchanged. Advantages: It's widely used due to its simplicity and ability to handle non-linearity in data. It's a good default choice for many problems. Example: It can be used for image recognition, where the presence or absence of certain features in an image needs to be detected. Leaky ReLU:

Usage: Leaky ReLU is a variant of ReLU and is also used in hidden layers for classification problems. Characteristics: Leaky ReLU allows a small, non-zero gradient for negative inputs, addressing the "dying ReLU" problem where ReLU units could become inactive during training. Advantages: It helps mitigate the vanishing gradient problem seen with ReLU. Leaky ReLU is a good choice when you suspect that a substantial portion of your neurons might be inactive during training. Example: It can be used for speech recognition tasks, where input data may contain both positive and negative values.

Review¶

Basics of Neural Networks¶

What is a perceptron? The Perceptron is a linear machine-learning algorithm for binary classification tasks. It may be considered one of the first and one of the simplest types of artificial neural networks. It is definitely not “deep” learning but is an important building block. It consists of a single node or neuron that takes a row of data as input and predicts a class label.

perceptron.jpg That is inputes of xj to wj to neuron then output

How the perceptron learning algorithm functions are represented in the above figure. In the above example, the perceptron has three inputs x1, x2, and x3, and one output.

This input variable’s importance is determined by the respective weights w1, w2, and w3 assigned to these inputs. The yield could be a 0 or a 1 relying upon the weighted sum of the data sources.

output=w{1}X{1}+w{2}X{2}+w{3}X{3}

If the output is below the threshold then the result will be 0 otherwise it will be 1. This edge could be a genuine number and a boundary of the neuron.

output = {0 if sum(wjxj)<threshold, 1 if sum(wjxj) > threshold }

Advantages of neural networks over traditional ML algorithms Due to the architecture of Artificial Neural Networks (ANN) that involves hidden layers and the ability to add neurons in these layers, ANN has the ability to learn and model nonlinear and complex relationships within the data. It has been found that in many real-world problems involving unstructured and structured data the relationship between dependent and independent variables is non-linear and complex in nature. It has been observed that as the amount of data increases the accuracy of neural networks increases as compared to traditional ML algorithms as shown in the graph. As the number of layers in a neural network increases, it becomes a deep neural network (deep learning). Screenshot (2).png

Nodes and layers in a neural network Screenshot (3).png

In [16]:
from IPython.display import Image, display

# Path to your image file
image_path = 'a.png'

# Display the image in the notebook
display(Image(filename=image_path))

The most common structure of a neural network is given as above. This network consists of input, hidden, and output layers. Each layer has nodes that are represented as circles. The lines between the nodes indicate the flow of information from one node to the next.

In this network, the flow of information is from input to output i.e., only in one direction. The nodes of the input layer are passive which means they do not modify the data. They receive a single value as an input and send the same value as outputs to various nodes. In comparison, the hidden and output nodes are active, i.e., they modify the data.

For example, the variables X11, X12, X13, … X115 are the data to be evaluated and represent the pixel values from an image or samples from an audio signal. Each value from the input layer is duplicated and sent to all the hidden nodes. The values entering the hidden node are multiplied by weights. The weighted inputs are then added to produce a single number. This is shown by the summation symbol in the diagram below. Before leaving this node, this number is passed through a nonlinear mathematical function called the sigmoid. This function limits the output value between 0 and 1. The number of hidden layers can be increased which makes the network a deep neural network.

In [17]:
from IPython.display import Image, display

# Path to your image file
image_path = 'b.png'

# Display the image in the notebook
display(Image(filename=image_path))

Why the input layer is considered passive¶

In neural networks, the input layer is often considered passive because it doesn't perform any computation or transformation of the input data. Instead, it serves as a conduit for passing the raw input features to the subsequent layers of the network. Here's why the input layer is typically referred to as passive:

Data Pass-through: The primary purpose of the input layer is to transmit the input data to the hidden layers and, eventually, to the output layer. It doesn't apply any weights, biases, or activation functions to the input features. The input features simply pass through the input layer to the first hidden layer.

No Learnable Parameters: Neural network layers consist of neurons (or units) that have learnable parameters, such as weights and biases, which are adjusted during training to make the network learn patterns in the data. The input layer doesn't have these parameters. It's just a placeholder for the raw data.

Linear Transformation: While hidden layers apply various non-linear transformations to the input data through activation functions and weights, the input layer performs a linear transformation. It's often represented as a weighted sum of the input features, where the weights are typically initialized randomly and adjusted during training.

No Activation Function: Hidden layers usually apply activation functions like ReLU (Rectified Linear Unit), sigmoid, or tanh to introduce non-linearity into the network. These activation functions enable the network to learn complex relationships in the data. In contrast, the input layer doesn't apply any activation function.

No Computation: During the forward pass of training or inference, the input layer doesn't contribute to the computation of gradients or errors, which are crucial for backpropagation (the process used to update network weights during training). It simply passes the input data to the next layer.

Understanding Gradient Descent¶

How does a neural network work?

We learned that every machine learning algorithm finds parameters that try to minimize the loss/error function. The same is the case with neural networks as well. A neural network tries to minimize the loss function using an optimization method called gradient descent.

Let’s understand gradient descent with the help of regression.

We know that the error function for regression is given by:

Error = (Y(hat)-Y)2¶

Where Y(hat) = Wx+ b;¶

This error function is a convex function.

Gradient descent has 2 aspects:

First forward propagation: Where we start with some random parameters W,b and propagate forward using these weights and calculate the error term.

After calculating the error term we update the parameters using the equation:¶

W=w- alpha*(derivative of loss function wrt w)

b=b- alpha*(derivative of loss function wrt b), Where alpha is the learning rate/step

In [18]:
from IPython.display import Image, display

# Path to your image file
image_path = 'c.png'

# Display the image in the notebook
display(Image(filename=image_path))

Again we calculate the error using the new parameters and then again update them. We repeat this process until we achieve convergence.

The same concept is used in deep neural networks. The only difference here is that the forward and backward propagation processes become lengthy because we have multiple hidden layers.

Let’s look at what happens at the hidden layer when applying gradient descent.

From the above discussion, we know that a regression problem takes inputs and assigns some weight to each input. These weights propagate forward and calculate the error term. Then they backpropagate and the weights get updated using the gradient descent algorithm.

There's one small point of difference with the hidden layer, however: the presence of the activation function.

For the 1st hidden layer, the linear combination of weights and the values from the input layer are passed through this activation function. The results generated after that are the values stored in the hidden layer.

For the 2nd hidden layer, the 1st hidden layer works as an input layer and the same process is repeated until the last layer (output layer). Then, the network calculates the error term and backpropagates to update the weights using gradient descent.

This process keeps on repeating until we reach the convergence point that shows minimum error.

This is how gradient descent is applied to deep neural networks.

In [19]:
from IPython.display import display, Math, Latex
display(Math(r'w \leftarrow w - \alpha \frac{\partial Loss}{\partial w}'))
$\displaystyle w \leftarrow w - \alpha \frac{\partial Loss}{\partial w}$

CHAT GPT's explaination of Gradient Descent¶

Gradient Descent for Linear Regression:¶

Linear Regression aims to find the best-fitting linear model for a dataset. Here's the Gradient Descent algorithm for Linear Regression:

Loss Function:

The Mean Squared Error (MSE) loss function for Linear Regression is given by:¶

In [20]:
from IPython.display import display, Math
display(Math(r'\text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (Y_i - (wX_i + b))^2'))
$\displaystyle \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (Y_i - (wX_i + b))^2$

Where:

N is the number of data points. Yi is the actual output for the i-th data point. Xi is the input for the i-th data point. w is the weight (slope). b is the bias (intercept).

Gradient Calculation:

Compute the gradients of the loss with respect to the parameters w and b:

In [21]:
display(Math(r'\frac{\partial \text{MSE}}{\partial w} = -\frac{2}{N} \sum_{i=1}^{N} X_i(Y_i - (wX_i + b))'))
display(Math(r'\frac{\partial \text{MSE}}{\partial b} = -\frac{2}{N} \sum_{i=1}^{N} (Y_i - (wX_i + b))'))
$\displaystyle \frac{\partial \text{MSE}}{\partial w} = -\frac{2}{N} \sum_{i=1}^{N} X_i(Y_i - (wX_i + b))$
$\displaystyle \frac{\partial \text{MSE}}{\partial b} = -\frac{2}{N} \sum_{i=1}^{N} (Y_i - (wX_i + b))$

Parameter Updates:

Update w and b using the learning rate (α):

In [23]:
display(Math(r'w \leftarrow w - \alpha \frac{\partial \text{MSE}}{\partial w}'))
display(Math(r'b \leftarrow b - \alpha \frac{\partial \text{MSE}}{\partial b}'))
$\displaystyle w \leftarrow w - \alpha \frac{\partial \text{MSE}}{\partial w}$
$\displaystyle b \leftarrow b - \alpha \frac{\partial \text{MSE}}{\partial b}$

Iteration:

Repeat steps 2 and 3 for a fixed number of iterations (epochs) or until the loss converges.

Gradient Descent for Logistic Regression:¶

Logistic Regression is used for binary classification. Here's the Gradient Descent algorithm for Logistic Regression:¶

Loss Function:

The Logistic Loss (cross-entropy) loss function for Logistic Regression is given by:

In [24]:
display(Math(r'\text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} [Y_i \log(\hat{Y}_i) + (1 - Y_i) \log(1 - \hat{Y}_i)]'))
$\displaystyle \text{Loss} = -\frac{1}{N} \sum_{i=1}^{N} [Y_i \log(\hat{Y}_i) + (1 - Y_i) \log(1 - \hat{Y}_i)]$

Where:

N is the number of data points. Yi is the actual label (0 or 1) for the i-th email.

^^

Yi is the predicted probability that the i-th email is spam.

Gradient Calculation:

Compute the gradients of the loss with respect to the parameters w and b:

In [25]:
display(Math(r'\frac{\partial \text{Loss}}{\partial w} = \frac{1}{N} \sum_{i=1}^{N} (\hat{Y}_i - Y_i)X_i'))
display(Math(r'\frac{\partial \text{Loss}}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} (\hat{Y}_i - Y_i)'))
$\displaystyle \frac{\partial \text{Loss}}{\partial w} = \frac{1}{N} \sum_{i=1}^{N} (\hat{Y}_i - Y_i)X_i$
$\displaystyle \frac{\partial \text{Loss}}{\partial b} = \frac{1}{N} \sum_{i=1}^{N} (\hat{Y}_i - Y_i)$

Parameter Updates:

Update w and b using the learning rate (α):

In [26]:
display(Math(r'w \leftarrow w - \alpha \frac{\partial \text{Loss}}{\partial w}'))
display(Math(r'b \leftarrow b - \alpha \frac{\partial \text{Loss}}{\partial b}'))
$\displaystyle w \leftarrow w - \alpha \frac{\partial \text{Loss}}{\partial w}$
$\displaystyle b \leftarrow b - \alpha \frac{\partial \text{Loss}}{\partial b}$

Iteration:

Repeat steps 2 and 3 for a fixed number of iterations (epochs) or until the loss converges.

Cost Function and Loss Function¶

Cost function and loss function are synonymous (some people also call it error function). A cost function is a measure of error between what your model predicts and what the actual value is.

The below function of theta represents the output of a model over xi variables

Cost Function for Linear Regression: In Linear Regression, the cost function is often defined as the Mean Squared Error (MSE):

In [30]:
from IPython.display import display, Math

display(Math(r'\text{ChatGPT Cost (MSE)} = J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (h_w(x^{(i)}) - y^{(i)})^2'))
$\displaystyle \text{ChatGPT Cost (MSE)} = J(w, b) = \frac{1}{2m} \sum_{i=1}^{m} (h_w(x^{(i)}) - y^{(i)})^2$

Where:

m is the number of training examples.

x^(i) is the input feature of the i-th training example.

y^(i) is the actual output (target) of the i-th training example.

h w(x i)) is the predicted output by the linear regression model with parameters w and b.

Loss Function for Logistic Regression:¶

In Logistic Regression, the loss function (or log loss) is used to measure the error between predicted probabilities and actual labels:

In [28]:
display(Math(r'\text{Log Loss} = L(w, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]'))
$\displaystyle \text{Log Loss} = L(w, b) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})]$

Where:

m is the number of training examples.

y (i) is the actual label (0 or 1) of the i-th example.

yhat(i) is the predicted probability that the i-th example belongs to class 1 (spam, in the case of email classification).

These equations represent the cost and loss functions for Linear Regression and Logistic Regression, respectively. They are used to quantify how well the model's predictions match the actual data and guide the optimization process during training.

Activation Functions¶

Activation functions are introduced to learn complex patterns in the data. The activation function decides what is to be fired to the next neuron. It takes input from previous layers and converts it to some form of input for the next layers.

The most important feature of activation functions is to introduce non-linearity into a neural network because if there is no non-linearity it's just a linear algorithm or we can say the model just tries to fit a straight line but there are more complex patterns in data that cannot be identified by a linear algorithm.

We can use different types of activation functions as given below:

The Sigmoid Function: It is one of the most widely used non-linear activation functions. Sigmoid transforms the values into a range between 0 and 1. It can be interpreted as the probability of a particular class. The mathematical expression for sigmoid:¶

f(z) = 1 / (1+e^-z)

The Tanh Function is very similar to the sigmoid function. The only difference is that it is symmetric around the origin. The range of values, in this case, is from -1 to 1. Thus the inputs to the next layers will not always be of the same sign. The tanh function is defined as:¶

tanh(z) = 2sigmoid(2z)-1 or 2 / (1+e^-2z) -1

The ReLU Function is another non-linear activation function that has gained popularity in deep learning. ReLU stands for Rectified Linear Unit. The main advantage of using the ReLU function over other activation functions is that it does not activate all the neurons at the same time. This means that the neurons will only be activated if the output of the linear transformation is greater than 0. The plot below will help you understand ReLU better:¶

f(z)=max(0,z)

The Softmax Function¶

The Softmax function returns the probability of each class. Here's the equation for the Softmax activation function:

softmax(z)= exp(z_i) / sum(exp(z_j))

Here, the Z represents the values from the neurons of the output layer. The exponential acts as the non-linear function. Later these values are divided by the sum of exponential values in order to normalize and convert them into probabilities.

Basics of Artificial Neural Networks¶

What are Artificial Neural Networks (ANNs)?¶

We know that our brain has billions and billions of neurons connected together to process information received through our ears, eyes, and other sensory organs as inputs and gives a response as an output.

Similarly, artificial neural networks have layers of neurons connected together to process the inputs, learn the task to perform, and give an output.

An artificial neural network can be thought of as a combination of linear and nonlinear equations which we use to produce the output of our desire by training it on our dataset. We can expect our Neural Network to learn the underlying relations between the input / independent variables and the output / dependent variable.

We are going to study fully connected neural networks where each and every neuron in a layer is connected to each and every neuron in its next layer. The below image shows a simple fully connected neural network:

Input Layer - Takes Inputs¶

Hidden Layer - Responsible for processing information¶

Output Layer - Gives Output¶

How Does A Neural Network Get Trained?¶

In the training stage of the neural network, weights are assigned to each connection between neurons. These weights are learnable parameters that are updated to find the optimal values.

Here, we can see that wa1,wa2,wa3, and wa4 are weights assigned to the connections of the 1st node of the input layer, wb1,wb2,wb3,wb4 are weights assigned to the connections of the 2nd node of the input layer, and so on.

Training Of A Neural Network:¶

Training of a neural network contains 2 main steps:

Forward Propagation¶

Forward propagation is how neural networks make predictions. Input data is “forward propagated” through the network layer by layer to the output layer which makes a prediction.

Backpropagation¶

In backpropagation, we propagate through the neural network backward i.e., from the output layer to the input layer, and update the weights and biases of the neural network.

Let's understand this in detail with the help of an example.

Let’s consider a neural network with “N” inputs and a single neuron in the hidden layer

Step 1: Forward Propagation¶

Step A:¶

In forward propagation, the data points LaTeX: x_1,\:x_2,\:...,\:x_N from the input layer are propagated to a single neuron where each input is multiplied with its respective weights and then summed together. Each neuron has also an error term called bias. The sum of the bias term and the linear combination of inputs and weights is the input to the single neuron as shown in the below image:

Step B:¶

In this step, we apply a nonlinear function to this linear combination. The functions we apply to these linear combinations are also known as Activation Functions. Activation Functions are supposed to introduce nonlinearity into our Neural Network. Simple linear functions in neural networks might not be helpful in learning complex patterns in data, hence we use non-linear activation functions to be able to learn complex patterns in our data.

In [31]:
from IPython.display import Image, display

# Path to your image file
image_path = 'd.png'

# Display the image in the notebook
display(Image(filename=image_path))

In the image above, we have applied a sigmoid function which is one of the activation functions.

What if we have multiple neurons and multiple layers?¶

If we have multiple layers, each neuron receives the outputs of the previous layer as its inputs. For example, you can see in the below image that the neurons 'A11', 'A12', 'A13' and 'A14' receive 'x1', 'x2' and 'x3' as the inputs. And sequentially the neurons 'A21', 'A22', 'A23', and 'A24' receive the outputs of 'A11', 'A12', 'A13', and 'A14' as their inputs. Each neuron drawn in the below image is an encapsulated representation of the image above, i.e., each neuron is supposed to represent the linear equation and the activation function clubbed together.

In [32]:
from IPython.display import Image, display

# Path to your image file
image_path = 'e.png'

# Display the image in the notebook
display(Image(filename=image_path))

Step 2: Calculate the Loss Function¶

After getting the output as a result from forward propagation, we will calculate the loss using the loss function. The weights and biases are updated in such a way that the loss function is minimized. There can be different types of loss functions depending on the nature of the problem. For example, for regression, we usually use mean squared error and for classification we use cross-entropy.

Step 3: Backpropagation¶

We try to reflect the error or cost term onto the weights of our Neural Network. Thus the way to do it is, we take the derivative of the cost with respect to a particular weight and then we shift the value of the weights in that direction as has been covered in the Functions And Derivatives section of the Pre Reads.

w = w - dc / dw

Where C is the error term and w is the weight we want to modify.

The algorithms used to update the weights and biases are known as Optimizers.

A few well-known optimizers are Gradient Descent, SGD, Batch SGD, etc.

Step 4: Repeat Forward Propagation and Backward Propagation until the cost function is minimized.¶

We repeat Forward Propagation and Backward Propagation until the cost/objective function is minimized.

The below graphic representation shows a single iteration of forward and backward propagation. In forward propagation, first, calculate the value for each node using the input layer and the activation functions. Secondly, make the predictions using the output layer and calculate the error/loss function using the predicted and the actual labels. In backward propagation, the weights and biases are updated using derivatives to optimize the loss function.

Optimizing the LossFunction¶

Pre-read - Stochastic Gradient Descent vs Mini-batch Stochastic Gradient Descent¶

In neural networks, the primary aim is to optimize the loss function, and various techniques can be used to optimize the loss function. Gradient descent is an optimization algorithm that helps to find the parameters of a neural network. Gradient descent is overwhelming, particularly because of weights and the number of rows needs to train a neural network. The problem with Gradient Descent was that during convergence, they reached the minima very slowly. Let’s have a look at the other variants of Gradient Descent which overcome this problem.

Below are the various variants of Gradient Descent:

Stochastic Gradient Descent Mini Batch Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Gradient descent is an optimization algorithm used in Machine Learning. During the training period, to find the derivative loss function, a random data point is selected instead of the whole data for each iteration.

For example:- Dataset consists of 1000 data points, and to calculate the derivative loss function, Stochastic Gradient Descent will be considering only one data point at a time. Below is the image that shows the path taken by SGD to converge to Global minima.

In SGD, convergence to global minima happens very slowly as it will take a single record in each iteration during forward and backward propagation.

Mini Batch Stochastic Gradient Descent¶

Mini Batch Gradient descent is another variant of gradient descent. During the training process, a batch of K data points is taken to calculate the derivative loss function.

For example:- A dataset consists of 1000 data points, and to calculate the derivative loss function, Mini Batch Stochastic Gradient Descent will be considering K=100 data points at a time. For each iteration, it will take 100 data points and calculate the loss function. The image below depicts the path Mini-batch SGD took to converge to Global minima.

Compared to SGD, Mini Batch SGD will converge fastly to global minima. Due to Batch-wise convergence, it will produce noise in its path to global minima.

Local Minima: Local minima refers to the point at which the value of the function is smaller than that at nearby points, however, it could be bigger than that at a distant point.¶

Global Minima: Global minima refers to the point where the function value is smaller than at all other feasible points.¶

Convex Function: A convex function has one minimum - a nice property, as an optimization algorithm won't get stuck in a local minimum that isn't a global minimum. For example:¶

Non-convex function: A non-convex function is wavy - has some 'valleys' (local minima) that are not as deep as the overall deepest 'valley' (global minimum). Optimization algorithms can be stuck at the local minimum, and it can be hard to tell when this happens. For example:¶

Learning rate: Learning rate is a parameter that is used to control the rate at which an algorithm learns the values of the parameters.¶

Challenges with Mini Batch SGD

While dealing with non-convex functions, there is a high chance of getting caught into local minima instead of global minima. We hope to find good local minima fast and when we talk about fast then the learning rate becomes an issue. Let’s look at the weight updation formula and understand these challenges in depth-

New weight = old weight - learning rate derivative loss

The learning rate is critical here, if it’s too high we could be oscillating and missing the optimal parameters. If it's too low then it’s taking a long time to converge and may be caught even marginally local minima as shown in the below figure.

In [34]:
from IPython.display import Image, display

# Path to your image file
image_path = 'f.png'

# Display the image in the notebook
display(Image(filename=image_path))

Summarizing SGD with Momentum¶

Stochastic Gradient Descent¶

Stochastic Gradient Descent with Momentum is a Stochastic Gradient Descent variant that works similarly to mini-batch SGD but adds momentum to overcome the noise.

In the case of Mini Batch SGD, when the model parameters after iterating through all the data points in the given batch (K) are updated, the direction of the update will have some variance, which leads to oscillations. Due to this oscillation, it is hard to reach convergence, and it slows down the process of attaining it. Momentum is used to combat this.

Momentum helps to avoid paths that do not lead to convergence. So, Stochastic Gradient Descent with Momentum uses exponentially weighted averages of gradients over the previous iteration to stabilize convergence. In other words, a fraction of the parameter update is taken from the previous gradient step and is added to the current gradient step.

The image below clearly distinguishes between Mini Batch SGD with and without momentum.

Before adding mometum, the formula for weight updation is:¶

New weight = old weight - learning rate derivative loss

After adding the momentum there, the formula for weight updation becomes¶

New weight = old weight - learning rate(weighted avg of derivatives/gradients + (weighted avg of previous gradients)

Adaptive learning rates¶

Adaptive learning rates, also known as adaptive optimization algorithms or adaptive gradient methods, are techniques used in neural networks to adjust the learning rate during training based on the characteristics of the optimization problem. The learning rate is a hyperparameter that determines the step size at which the model's parameters (weights and biases) are updated during the training process. Adaptive learning rates are designed to improve the convergence and training efficiency of neural networks.

Here are some common adaptive learning rate algorithms:

AdaGrad (Adaptive Gradient Algorithm): AdaGrad adapts the learning rate individually for each parameter in the network. It accumulates the squared gradients for each parameter over time and uses these accumulated values to scale the learning rates. Parameters that have large gradients will have smaller learning rates, while parameters with small gradients will have larger learning rates. This makes AdaGrad well-suited for sparse data problems because it automatically adapts to the varying importance of different features.¶

Learning rate is one of the important parameters in updating the weights and optimization. The learning rate should get smaller and smaller when the model is close to the minima, and when it is far from the minima, the learning rate should make a large jump. This means that the learning rate needs to be adaptive in nature. The model also needs to make optimal progress in all directions to be able to get the minimum faster. There are a few techniques available for the learning rate to adapt.

There are many model parameters (Weights) and layers to train in deep learning. The goal is to find the best values for each of these weights. In all of the previous methods, the learning rate was a constant value for all the parameters of the network. However, Adagrad sets the learning rate adaptively based on a parameter, hence the name adaptive gradient.

Let's take a look at how AdaGrad changes its learning rate based on a parameter.

Note: not great for non-convex surfaces, learning rate is too fast

In the below equation, 's' represents the sum of the squares of the previous gradient step for the given parameter. Whenever a variance or standard deviation is scaled, it usually adds an epsilon (slightly away from zero). This epsilon is added to avoid situations where the denominator can become zero.

When the sum of the squared past gradients (s) has a high value, the learning rate basically gets scaled by a high value, so the learning rate will become less. Similarly, if the sum of the squared past gradients has a low value, the learning rate gets scaled by a lower value, and the learning rate value will become high.

Adagrad makes the learning rate dimension-specific, and the learning rate also decreases as the model moves closer to the minima, which is an indication of adaptive learning.

Adagrad performs well for convex optimization problems, but for extremely complex non-convex surfaces, the Adaptive gradient does not work very well. In Adagrad, the learning rate goes to zero so fast that it does not even reach the local minima.

In each iteration, the past squared gradients accumulate and sum, resulting in an increase in the value of the sum of the squared past gradients. When the sum of the squared past gradient value is high, so is the denominator value. When the learning rate is divided by a large number, it becomes small. When the learning rate reaches a low value, it takes a long time to attain convergence, which leads to a vanishing gradient problem.

In [ ]:
 
In [35]:
from IPython.display import Image, display

# Path to your image file
image_path = 'g.png'

# Display the image in the notebook
display(Image(filename=image_path))

RMSprop (Root Mean Square Propagation): RMSprop is similar to AdaGrad, but it addresses one of its limitations. In AdaGrad, the learning rates keep decreasing throughout training, which can lead to very small learning rates in the later stages of training. RMSprop mitigates this issue by using a moving average of squared gradients, which gives it a "decay" property, ensuring that the learning rates don't decrease too rapidly.¶

This is better for non-convex surfaces

the learning rate is slower because it uses the moving average of gradients

In [1]:
from IPython.display import Image, display

# Path to your image file
image_path = 'h.png'

# Display the image in the notebook
display(Image(filename=image_path))

Adam (Adaptive Moment Estimation): Adam stands for Adaptive Moment Estimation and is the most widely used optimizer in deep learning. It is one of the most used optimizer in the data science community.¶

Adam is a combination of both RMSprop and SGD with Momentum. Like RMSprop, it utilizes the squared gradients to scale the learning rate and, like SGD with momentum, uses the moving average of the gradient as an alternative to the gradient itself for fewer oscillations for convergence.

Uses (RMSprop) + (SGD with momentum)

In [2]:
from IPython.display import Image, display

# Path to your image file
image_path = 'i.png'

# Display the image in the notebook
display(Image(filename=image_path))

Adadelta: Adadelta is another adaptive learning rate method that aims to address the issue of diminishing learning rates in AdaGrad. It uses a moving window of past gradients to compute a running average of squared parameter updates. This helps prevent the learning rate from decreasing too aggressively.¶

Nadam: Nadam combines the ideas of Nesterov's Accelerated Gradient (NAG) and Adam. It incorporates the momentum term from NAG into the parameter update rule of Adam. Nadam is designed to provide the benefits of both momentum and adaptive learning rates.¶

These adaptive learning rate algorithms help overcome some of the challenges associated with choosing a fixed learning rate for training neural networks. They automatically adjust the learning rates for each parameter, which can lead to faster convergence and improved model performance. The choice of which algorithm to use often depends on the specific problem and dataset, and hyperparameter tuning is still necessary to achieve optimal results.

Summarizing Weight initialization and its Techniques¶

Weight Initialization is a procedure to assign weights to a neural network with some random values during the training process in a neural network model. The purpose of using the weight initialization technique is to prevent the neural network from exploding or vanishing gradients during forward and backward propagation. If either occurs, the neural network will take a longer time to converge.

Different cases with different weights assigned to neural network

Initializing all weights to 0: When all weights are set to 0, the derivative with respect to the loss function is the same for every weight (w) in every layer, so all weights have the same values in the subsequent iteration, and the hidden unit is symmetric, and the neural network model learns nothing new about the data.¶

CG: Initializing all weights to 0: This issue is primarily related to the vanishing gradient problem. When all weights are initialized to the same value, such as 0, during backpropagation, the gradients for all weights in the network are also the same. This results in symmetric updates to the weights, causing the network to learn very slowly or not at all. The gradients do not provide the necessary information for each weight to update properly, leading to a lack of diversity in weight updates and slow convergence.

Initializing all weights with a large number: When all weights are set to a large number, the derivative with respect to all weights is large during backpropagation, and the model takes a long time to converge.¶

Initializing all weights with a large number: This problem is related to the gradient explosion problem. When all weights are set to large values, the gradients during backpropagation are also large. Large gradients can lead to weight updates that are too large, causing the optimization process to diverge rather than converge. It may make training unstable and result in NaN (not-a-number) values in the weights or activations.

In both cases, the key issue is that weight initialization affects the scale of gradients during training. Proper weight initialization techniques, such as Xavier/Glorot initialization or He initialization, are designed to mitigate these problems by controlling the initial scale of weights and gradients. This helps ensure that gradients neither vanish to zero nor explode to infinity, which can lead to more stable and faster training of neural networks.

So, how to initialize the weights?¶

As previously discussed, initializing the weights with 0 or large numbers is not an appropriate method because it causes the network to take longer to converge to minima. However, there are two different weight initializing techniques that can be used for different activation functions to overcome the above problems.

WEight Techniques to fight Gradient Explosion or vanishing:

Activation function: Relu, Leaky Relu Weight Initialization technique: HE Initialization

Activation function: Sigmoid, tanh Weight Initialization technique: Xavier Initialization

Xavier initialization, also known as Glorot initialization, is a technique used to initialize the weights of neural network layers in a way that helps facilitate efficient training. It was introduced by Xavier Glorot and Yoshua Bengio in their 2010 paper "Understanding the difficulty of training deep feedforward neural networks."¶

The main idea behind Xavier initialization is to set the initial weights of neurons in a layer in a manner that balances the scale of activations during forward and backward propagation, thereby preventing gradients from vanishing or exploding too quickly. This is particularly important for deep neural networks, where gradient scaling issues can impede convergence.

Here's how Xavier initialization works:

1. For Sigmoid and Hyperbolic Tangent (tanh) Activation Functions:¶

When using the sigmoid or hyperbolic tangent (tanh) activation functions, Xavier initialization suggests initializing the weights using a uniform or normal distribution with mean 0 and variance:

makefile Copy code

variance = 1 / (number_of_input_units + number_of_output_units)

Then, the weights are typically sampled from a distribution with mean 0 and a standard deviation equal to the square root of the variance.

For a layer with n_in input units and n_out output units, the initialization for each weight would look like:

scss Copy code

weight = random_number * sqrt(variance)

2. For ReLU (Rectified Linear Unit) Activation Functions:¶

When using the rectified linear unit (ReLU) activation function, Xavier initialization suggests a slightly different approach because ReLU activations tend to be more sensitive to the choice of initialization. In this case, the weights can be initialized with a Gaussian (normal) distribution with mean 0 and variance:

makefile Copy code

variance = 2 / (number_of_input_units + number_of_output_units)

As with the Sigmoid and Tanh cases, the weights are typically sampled from this distribution:

scss Copy code

weight = random_number * sqrt(variance)

The goal of Xavier initialization is to ensure that the variance of the activations remains roughly the same across different layers of the network. If the weights are too small, the signal may vanish as it propagates through the network, and if they are too large, it may explode, making training difficult. Xavier initialization helps strike a balance between these two extremes.

It's worth noting that while Xavier initialization can be a good default choice for weight initialization, the specific initialization method you choose may depend on the activation functions and architecture of your neural network. In some cases, you may need to fine-tune or adjust the initialization method to achieve the best results for your particular problem.

Summarizing Regularization¶

Data contains information and noise. Considering the below image, the first image is an example of underfitting, the second image is a good fit, and the third image contains a lot of parameters, and it’s more complex, which indicates that the model will overfit.

In [4]:
from IPython.display import Image, display

# Path to your image file
image_path = 'j.png'

# Display the image in the notebook
display(Image(filename=image_path))

Underfit: This model is neither captured information nor noise.¶

Good fit: This model has captured only the information.¶

Overfit: This model has captured both information and noise¶

Overfit is a situation where a model gives good performance on the train set but poor performance on the test set. Underfit is a situation where model performance is poor on both train and test sets. The term "regularization", describes methods for calibrating machine learning/ deep learning models to adjust the loss function and avoid overfitting or underfitting.

The more complex a model is, the more parameters it must fit, and the model tends to overfit. Neural networks are prone to overfitting because of the larger number of parameters. Neural networks can model higher-order and complex functions, which makes them more prone to overfitting. The way to force a neural network not to overfit is to reduce the complexity of the network.

L1 and L2 regularization are commonly used techniques in machine learning for both regression and classification.

There will be a lot of weights in a neural network, some of which will influence the loss function to be high. The **goal is to reduce or eliminate such weights by adding a penalty term in the loss function.

L(y, yhat) + lambda(penalty)

In this case, lambda is a parameter that controls how much the penalty term influences the loss function.

There are two forms of penalty terms that can be used to reduce complexity.

L1 Regularization or Lasso: Takes the sum of absolute weight values.¶

Lasso is also known as the least absolute shrinkage operator and selection operator. Because it uses absolute weights, some weights that are close to zero will become zero and be eliminated, implying that the connection between the two nodes will be disconnected, thus reducing the complexity.

Lasso Loss = Original Loss + lambda * (|w1| + |w2| + ... + |wn|)

L2 Regularization or Ridge: Takes the sum of the square of the weights.¶

The Ridge method will not make the weights zero; instead, it will significantly reduce the value of the weights. The connections between nodes with less significant information for prediction will have less influence, reducing complexity.

Ridge regularization encourages all the weights to be small but typically doesn't drive them to exactly zero. It distributes the regularization penalty more evenly across all the weights, which makes it less prone to feature selection compared to Lasso.

Ridge regularization is often used when you want to prevent the model from overfitting without necessarily discarding any of the input features.

Ridge Loss = Original Loss + lambda * (w1^2 + w2^2 + ... + wn^2)

Data Augmentation¶

Data augmentation is another technique used in neural networks to reduce the underfitting problem. It is similar to the over-sampling technique used in machine learning. By adding slightly changed versions of already existing data or generating new synthetic data from existing data, data augmentation is used to expand the amount of data available. When a deep learning model is being trained, it serves as a regularizer and aids in reducing underfitting. Data Augmentation method is applied on the image data.

Dropout¶

Dropout helps solve overfitting by making sure that no single neuron becomes too critical in the learning process. In overfitting, neurons might work together too closely, causing problems when the model faces new data.

With dropout, some neurons are randomly turned off during each training step. This stops them from excessively compensating for each other's mistakes, which is a major cause of overfitting. Dropout encourages the network to be more self-reliant, making it better at handling new data.

Instead of training the entire team, we can select some team members and train them, and in the next round, we can select another set of members. This way, we will be able to identify the potential of each team member.

Dropout is a regularization technique used in neural networks to prevent overfitting. Overfitting occurs when a neural network learns to perform exceptionally well on the training data but struggles to generalize to new, unseen data. Dropout helps mitigate this problem.

In dropout, during the training process, a random subset of neurons in a particular layer is "dropped out" or temporarily turned off with a probability p (a hyperparameter typically set between 0.2 and 0.5). This means that these neurons do not contribute to the forward or backward passes during that training iteration. Essentially, dropout "deactivates" a portion of the neurons in each layer temporarily.

Here's how dropout works in more detail:

Forward Pass: During the forward pass (when data is fed into the network), each neuron's output is multiplied by a Bernoulli-distributed random variable with a probability of (1 - p) and then divided by (1 - p). This scaling ensures that the expected value of the neuron's output remains the same, preventing the network from becoming overly reliant on any particular set of neurons.¶

Backward Pass: During backpropagation (when the network learns from its mistakes), only the active neurons (those that were not dropped out) contribute to the gradient calculation. This encourages the network to distribute the learning across different neurons, preventing co-adaptation of neurons, which is a common cause of overfitting.¶

The main benefits of dropout are:¶

Regularization: Dropout acts as a form of regularization, helping to reduce overfitting by preventing the network from relying too heavily on specific neurons.¶

Ensemble Effect: Dropout can be seen as training multiple neural networks with shared parameters, as each dropout configuration during training is effectively a different subnetwork. When making predictions, dropout is usually turned off, but the model effectively approximates an ensemble of these subnetworks, which tends to improve generalization.¶

Robustness: Dropout can make the network more robust to noise in the input data because it learns to be less sensitive to the presence or absence of any particular neuron.¶

It's important to note that dropout is typically used during training but not during inference or when making predictions. During inference, all neurons are used, but their weights are scaled by (1 - p) to ensure consistent behavior.

Dropout is just one of many regularization techniques available in neural networks, and its effectiveness can vary depending on the specific problem and architecture. However, it has been widely adopted and proven effective in improving the generalization performance of deep neural networks.

In [5]:
from IPython.display import Image, display

# Path to your image file
image_path = 'k.png'

# Display the image in the notebook
display(Image(filename=image_path))

Summarizing Batch Normalization¶

Batch Normalization Summary: Batch Normalization is one of the powerful regularization techniques used to overcome the overfitting problem in Neural Networks. Batch normalization is a technique for improving the performance and stability of neural networks. Batch Normalization will normalize the inputs of each layer in such a way that they have a mean output activation of zero and a standard deviation of one.

There are usually two stages in which Batch Normalization is applied:

Before the activation function (non-linearity) After the activation function (non-linearity)

In [6]:
from IPython.display import Image, display

# Path to your image file
image_path = 'l.png'

# Display the image in the notebook
display(Image(filename=image_path))

Batch Normalization is typically used after activation functions.

How Batch Normalization works?¶

Consider the following example to see how this works. We have a deep neural network, as illustrated in the image below.

X1,X2,X3 - Inputs Nij= i is a hidden layer, j= Neuron in the hidden layer. N11 = First hidden layer’s first Neuron O = Output

  1. Initially, our inputs X1, X2, and X3 are in normalized form as they are coming from the pre-processing stage. When the input passes through the first layer, it transforms as a sigmoid function is applied. Similarly, this transformation will take place for the second layer and go on till the last layer.
  2. Although our input X was normalized with time, the output will no longer be on the same scale. As the data goes through multiple layers of the neural network and n activation functions are applied, it leads to an internal covariate shift in the data.
  3. The input to the N21 neuron will be from the outputs of N11,N12,N13 and N14, these outputs will have different distributions, and hence the output from N21 will also have different distributions, and this is called an internal covariate shift.
In [7]:
from IPython.display import Image, display

# Path to your image file
image_path = 'm.png'

# Display the image in the notebook
display(Image(filename=image_path))

What is covariate shift?¶

Covariate shift is a specific type of dataset shift often encountered in machine learning. It is when the distribution of input data shifts between training data and test data, and this even results in overfitting.

How to overcome this issue?¶

To overcome this internal covariate shift, Batch Normalization will be applied to neural networks. Batch Norm is just another network layer that gets inserted between a hidden layer and the next hidden layer, as shown in the below image. Its job is to take the outputs from the first hidden layer and normalize them before passing them on as the input of the next hidden layer.

In [8]:
from IPython.display import Image, display

# Path to your image file
image_path = 'n.png'

# Display the image in the notebook
display(Image(filename=image_path))

In the above image, batch normalization is applied to both hidden layers. Since batch normalization is applied to the previous hidden layer, the inputs to the N21 neuron will all have the same distribution i, e mean will be zero, and standard deviation will be one, and hence the covariances are not shifted.

Since all of the neurons in the network receive inputs from the same distribution, there will be no covariate shifts in the data and, as a result, no overfitting.

RNN¶

RNN stands for Recurrent Neural Network. It is a type of artificial neural network designed for processing sequences of data. RNNs are particularly well-suited for tasks where the order of the data elements matters, such as natural language processing, speech recognition, and time series analysis.

The key feature of RNNs is their ability to maintain a hidden state or memory that captures information about previous elements in the sequence. This hidden state is updated at each step in the sequence and influences the network's predictions or computations at the current step. This recurrent nature allows RNNs to model sequential dependencies effectively.

Here's a simplified explanation of how an RNN works:

Input Data: At each time step in a sequence, the RNN receives an input, which could be a feature vector, a word in a sentence, or a data point in a time series.

Hidden State: The RNN maintains a hidden state, which is essentially a representation of the information it has seen in the past. This hidden state is updated at each time step based on the current input and the previous hidden state.

Output: The RNN can produce an output at each time step or just at the final step, depending on the specific task. The output is typically influenced by both the current input and the hidden state.

Mathematically, you can represent the computation in an RNN as follows, where

represents the hidden state at time step

are weight matrices that the network learns during training.

RNNs are powerful tools for sequence modeling but suffer from some limitations. One significant issue is the vanishing gradient problem, which can make it difficult for RNNs to capture long-range dependencies in sequences. This limitation has led to the development of more advanced RNN variants, such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU), which are designed to address this problem and are often preferred for practical applications.

In summary, an RNN is a type of neural network that excels at handling sequential data by maintaining a hidden state that captures information from previous steps in the sequence, making it suitable for tasks like natural language processing and time series analysis.

In [ ]:
#example of RNN function

# Initialize the hidden state
h_t = initial_hidden_state

# Loop through the sequence
for t in range(sequence_length):
    # Get the input at time step t
    x_t = input_sequence[t]

    # Update the hidden state using the current input and previous hidden state
    h_t = activation_function(W_hh * h_t + W_hx * x_t)

# The final hidden state can be used for making predictions or further computations
final_hidden_state = h_t

CNN¶

CNN stands for Convolutional Neural Network, and it's a type of deep learning model primarily used for tasks involving image analysis, but it can also be applied to other grid-like data, such as audio spectrograms or even text data in some cases. CNNs are particularly effective at capturing spatial patterns in data.

Here's an explanation of CNNs along with a Python-like representation of a convolution operation:

In [ ]:
import numpy as np

# Define an example 2D input image (grayscale)
input_image = np.array([
    [1, 2, 1, 0],
    [0, 1, 2, 1],
    [1, 2, 2, 2],
    [2, 2, 1, 0]
])

# Define a 2D convolutional kernel (filter)
kernel = np.array([
    [1, 0],
    [0, -1]
])

# Perform the convolution operation
def convolution(input_image, kernel):
    input_height, input_width = input_image.shape
    kernel_height, kernel_width = kernel.shape
    output_height = input_height - kernel_height + 1
    output_width = input_width - kernel_width + 1

    # Initialize an empty output feature map
    output_feature_map = np.zeros((output_height, output_width))

    # Perform the convolution
    for i in range(output_height):
        for j in range(output_width):
            output_feature_map[i, j] = np.sum(
                input_image[i:i+kernel_height, j:j+kernel_width] * kernel
            )

    return output_feature_map

# Apply the convolution operation to the input image using the kernel
output_feature_map = convolution(input_image, kernel)

# Print the resulting feature map
print(output_feature_map)

In this Python-like code:

input_image represents a 2D grayscale image. kernel is a 2D convolutional filter. The convolution function performs the convolution operation between the input image and the kernel. The convolution operation involves sliding the kernel over the input image, element-wise multiplying the overlapping portions, and summing the results to produce an output feature map. This feature map captures certain patterns or features in the input data.

In practice, CNNs consist of multiple layers of convolutional operations, followed by activation functions and pooling layers to create hierarchical representations of the input data. These representations are then used for various tasks, such as image classification, object detection, and more. Libraries like TensorFlow and PyTorch provide efficient ways to build and train CNNs for real-world applications.

FAQ - Building Blocks of Neural Networks Q1. The various methods for creating Neural Networks models are listed below. What is the difference between these codes? What effect do these codes have on the model? model = tf.keras.Sequential()

 model = keras.Sequential() 

 model = Sequential()

The only difference between the three codes mentioned above is in calling the Sequential function, and all three codes will work and produce the same model output.

model = tf.keras.Sequential() In the above code, the Sequential function is called using the TensorFlow and Keras library.

model = keras.Sequential()

In the above code, the Sequential function is called using the Keras library.

from tensorflow.keras.models import Sequential

model = Sequential()

In the above code, the Sequential function is called using Tensorflow and Keras libraries after importing the libraries.

Q2. Is it always necessary to use batch normalization when building models with ANN? What is its application? No, Batch Normalization is not always used in the model unless the model is overfitting. Batch Normalization is one of the techniques used in Neural Networks to prevent the model from overfitting.

Q3. How do we decide how many neurons to include in each of the hidden layers? Why 256, 64, 32, etc.? There are a few hyperparameters in Neural Networks that we will pass to the model, and the number of neurons in the model is one of them, so there is no rule of thumb for what number to use in the model.

As we know, computers work only with 0's and 1's, which are only two digits. This is why all the memory addressing, resolutions (in games), and size of storage devices are powers of two. And this is the reason for the number of neurons to be powers of two, too. Neural networks require GPUs for faster processing. And there is no rule that we should only use the power of two numbers as the neurons. We can use the non-power of two numbers as well.

Q4. What is the function of 'units' in the code below? model2.add(Dense(activation = 'relu', input_dim = 11, units = 128))

Units in dense layers are the number of neurons present in the dense layer. You can specify the number of neurons directly or define units = 128, which is shown in the below example.

model2.add(Dense(activation = 'relu', input_dim = 11, units = 128))

model2.add(Dense(128, activation = 'relu', input_dim = 11))

Q5. Let’s say there is a dataset with 11 columns and 10000 rows. So the input dim should be equal to the number of columns, and the units should be equal to the number of rows? Yes, if there are 11 columns in the dataset then the input _dim will be 11, but the units in the input layer will not be equal to the number of rows in the dataset. The number of units that should be present in the input/hidden layer is a hyperparameter that will be passed to the model. The number of units in the hidden layer can be 16,32,64,128,512 and 1024 etc.

Q6. Why is it necessary to apply to_categorical on the target column? Is this due to the Multi-Class classification problem? Yes, for a multi-class classification problem, we must encode the target column by to_categorical to remove the weightage of the number.

In the week-1 hands-on, we used a multi-class classification with 10 classes to predict between 0 and 9. When we use this target variable in its current form, the model interprets the highest number, 9, as having more weight than other numbers, and the model is biased toward the highest number. To remove the weightage of the number, we will encode the target variable with to_categorical.

Adjust the Model Architecture:¶

  1. Experiment with different architectures, such as adding more hidden layers, increasing the number of neurons in existing layers, or using different activation functions. A more complex model may capture intricate patterns better, but be cautious of overfitting.

Regularization Techniques:¶

  1. You've already added a dropout layer, which is a form of regularization. You can try adjusting the dropout rate (0.5 in your case) or consider other regularization techniques like L1 or L2 regularization.

Learning Rate and Optimizer:¶

  1. Experiment with different learning rates and optimizers (e.g., Adam, RMSprop, or SGD). The choice of optimizer and learning rate can impact convergence speed and final performance.

Batch Size:¶

  1. Adjust the batch size used during training. Smaller batch sizes might allow the model to generalize better, but training may take longer.

Epochs:¶

  1. Train for more epochs if the model has not yet converged. However, monitor for overfitting as you increase the number of epochs.

Feature Engineering:¶

  1. Carefully examine your input features (the 10 columns) and consider if there are additional features that could be engineered to provide more information to the model.

Data Preprocessing:¶

  1. Ensure your data is properly preprocessed, including scaling and normalizing the input features. It's essential to preprocess data consistently across training and validation sets.

Data Augmentation (if applicable):¶

  1. If you have limited data, you can use data augmentation techniques to create additional training examples.

Early Stopping:¶

  1. Implement early stopping to prevent overfitting. Monitor the validation loss, and if it starts increasing after a certain number of epochs, stop training.

ANN vs CNN¶

ANN's¶

ANN's are sensitive to movements. So if a cat is on the left of the image, the ANN might be able to identify, but if it's moved to the right, it can't interpret

ANN's lose spatial information.

ANN's have a difficult time ignoring features that are not related to the target we are trying to identify.

CNN's¶

Resolve issue that ANN's have

CNN's have spatial and translational invariance. So no matter where the cat is in the image, CNN's can detect it.

CNN's use convolusional and pooling layers that hover over the entire image, extract features from the image that match the patterns in the image.

WEight sharing - CNN's reduce the number of weights that must be learned making it faster and cost's less.

WEight sharing also helps with the filter search being insensitive to the location of the image

So with pooling, CNN's look at regions and not pixel by pixel

1¶

Applying weights to the image

2¶

Relu is frequently used as the activation function for CNN's¶

A benefit of using Relu is it uses the absolute value of numerical info. We don't want negative numbers when looking at images. You can't have negative brightness

3¶

pooling layer. this is the feature extraction stage. The pooling layer helps in removing unwanted features from the image, it reduces the size of the image thus reducing the cost for the model. CNNs can use max pooling which dispose of any unessessary pixel info, but in the process we can lose information

CNN order of operations¶

1 Image input¶

2 Feature extraction pt 1 - Convolutional layer (ex 3x3x3 matrix) + activation layer (Relu)¶

3 Feature Extraction pt2 - Pooling layer¶

4 Prediction pt1 - Flatten data into vertex¶

5 Prediction pt2 - fully connected layer¶

6 output¶

Pooling methods¶

Pooling is an essential operation in Convolutional Neural Networks (CNNs) that helps reduce the spatial dimensions of feature maps while retaining important information. Pooling is primarily used to achieve spatial invariance, reduce computational complexity, and control overfitting. There are two common types of pooling layers in CNNs: Max Pooling and Average Pooling.

Max Pooling:

Max pooling is the most widely used pooling method in CNNs. In each pooling operation, a window (typically 2x2 or 3x3) slides over the input feature map, and for each window, the maximum value within the window is retained while the rest are discarded. Max pooling helps in capturing the most important features in a local neighborhood, making it particularly effective for tasks like object recognition. Max pooling introduces translation invariance because it retains the most dominant features within a region, regardless of their exact location. Example in TensorFlow/Keras: python Copy code tf.keras.layers.MaxPooling2D(pool_size=(2, 2)) Average Pooling:

Average pooling, as the name suggests, computes the average value of the elements within the pooling window. It smooths the feature maps and reduces the spatial dimensions similarly to max pooling but retains less detailed information. Average pooling is sometimes used when a smoother representation of the data is desired. Example in TensorFlow/Keras: python Copy code tf.keras.layers.AveragePooling2D(pool_size=(2, 2)) Additionally, there's another pooling variant known as Global Average Pooling (GAP):

Global Average Pooling (GAP): GAP computes the average value of each feature map over its entire spatial dimensions. It reduces the spatial dimensions to 1x1 for each feature map. GAP is often used as the final pooling layer before the fully connected layers in CNNs. It helps in reducing the number of parameters in the network and improving generalization. Example in TensorFlow/Keras: python Copy code tf.keras.layers.GlobalAveragePooling2D() The choice between max pooling, average pooling, or GAP depends on the specific task and the characteristics of the data. Max pooling is most common for tasks like image classification and object recognition, while average pooling or GAP may be used when a smoother representation or fewer parameters are desired.

Overfitting¶

A solution to overfitting is regularlization

If we are talking about images, some techniques are:

  1. Data augmentation:
    1. geometric transfomrations - images can be flipped, cropped, rotated
    2. mixing images - combining images: pixel averaging, crop overlaying
    3. color space transformations - change brightness, contrast, rgb channels,
    4. random erasing - deleting parts of the image
    5. kernel filters - sharpen blur images with filters

Batch normalization¶

When using images in a DNN the computation can get costly. Normalizing values bw a range of 0 and 1 will reduce computation expense

Normalization also helps prevent exlpoding and vanishing grandients, where if there is liek a division of a number that is close to 0 it might look like it is infinity and crash the model, and normilization helps prevent that

Batch normalization speeds up the training of NN by reducing the interval covariate shift.

Interval Covariate shift¶

Interval covariate shift is a form of distribution change in machine learning where the input feature distribution varies differently across subsets or intervals within the data.

This type of shift can make it challenging for models to generalize effectively, as they need to adapt to varying distributions within different data subsets.

Addressing interval covariate shift often requires preprocessing, feature engineering, or domain-specific modeling strategies to ensure robust model performance.

An example of this issue, say the trained model only sees white cats, but in the testing set there are black cats. the model will not know what the balck cats are

Covariate Shift¶

Is a problem becasue the statistical properties of the input data, such as the mean, variance, or shape of the distributions, are different in the training and test datasets. This shift in feature distributions can lead to challenges in model generalization because a model trained on one distribution may not perform well on a different distribution, potentially resulting in decreased predictive accuracy.

Batch normalization activates each input variable per mini-batch, which means during the wieght update, the assumptions made by the subsequent layer regarding the spread and distribution on inputes with not alter drastically. when distribution of inputs to each layer is similar, training a network is more efficient and faster

Batch normalization also provides a weak form of regularlization. it adds noise to the data which results in regularlization

filters¶

Individual filters of a CCN are NOT rotation invariant

SPatial Dropout¶

We can ignore or turn off certain nodes while training the model just for one step of the training. It makes the model more likely to not overfit. I guess more complex models are more prone to overfitting

The basic dropout method might not help with overfitting because in images pixels are so highly correlated that dropping pixels doesn't actually do anything

Spatial dropout - is where we drop out entire feature maps.

Feature maps: In a convolutional layer, feature maps are produced by applying convolutional filters (also called kernels) to the input data. These filters slide or convolve across the input, capturing different patterns and features.

Are NOT dropped at random

Ideal CNN model¶

  1. Low bias
  2. Low variance
  3. High performance (recall/percision/F1 Score)

Recuirements¶

  1. A large image dataset
  2. a diverse image dataset
  3. high quality (labelled) image dataset
  4. High computational power

Transfer learning¶

Transfer learning can be used when Limited amount of labeled data and computation power

Transfer learning is a machine learning technique where a model trained on one task or dataset is adapted or reused as a starting point for a different but related task or dataset. Instead of training a model from scratch, transfer learning leverages knowledge learned from one domain to improve the performance of a model in another domain. Transfer learning is particularly useful when the target task has limited data or computational resources.

Here are key characteristics and advantages of transfer learning:

Pretrained Model: In transfer learning, a pretrained model is used as a base. This pretrained model has already learned useful features and representations from a large and diverse dataset, often in a different domain. Popular pretrained models include architectures like VGG, ResNet, Inception, and BERT, which have been pretrained on massive datasets like ImageNet or Wikipedia.¶

Fine-Tuning: After loading the pretrained model, one can fine-tune it on a smaller, domain-specific dataset for the target task. During fine-tuning, some or all of the model's layers are updated to adapt to the new task. Lower layers capture low-level features (e.g., edges in images), while higher layers capture more abstract and task-specific features. Importing a pre-trained model along with its weights and biases, and continuing its training with a new dataset. Instead of initialising the weights and biases with random distributions we intitialize then with the weights and biases of the pre-trained model¶

Feature Extraction: Another approach in transfer learning is feature extraction, where you use the pretrained model as a fixed feature extractor. You remove the top layers of the model and use the activations of the remaining layers as features for your target task. These features can then be fed into a new classifier or regression model.¶

Advantages:

Improved Performance: Transfer learning often leads to faster convergence and better performance compared to training a model from scratch, especially when the target task has limited data. Reduced Data Requirements: It can be applied effectively even when you have a small dataset for the target task. Generalization: The pretrained model captures general features and knowledge, which can be useful for a wide range of related tasks. Saves Time and Resources: It saves time and computational resources compared to training a deep neural network from scratch. Domains of Application: Transfer learning is widely used in computer vision, natural language processing, speech recognition, and other domains. For instance, a model pretrained on a large text corpus can be fine-tuned for sentiment analysis, and an image classification model can be reused for object detection.

Challenges: Despite its advantages, transfer learning requires careful selection of the pretrained model, appropriate fine-tuning strategies, and consideration of domain differences. Mismatches between the source and target domains can limit the effectiveness of transfer learning.

ImageDataGenerator¶

ImageDataGenerator is a utility in Keras, a popular deep learning framework, used primarily for data augmentation and real-time data preprocessing when working with image data. It is especially valuable when training convolutional neural networks (CNNs) for tasks such as image classification, object detection, and image segmentation. ImageDataGenerator allows you to generate batches of augmented data on-the-fly during model training, which can improve the generalization and robustness of your models.

Here are some key features and functions of the ImageDataGenerator:

Data Augmentation: ImageDataGenerator can apply a wide range of data augmentation techniques to your training images. These techniques include random rotations, shifts, flips, zooms, shear transformations, and brightness adjustments. Data augmentation helps increase the diversity of training data, which can lead to better model performance and reduced overfitting.

Normalization: It can perform real-time data preprocessing, such as feature scaling and mean-centering, on the input images. Normalization ensures that pixel values are within a specific range (e.g., [0, 1] or [-1, 1]), which can improve the convergence of the training process.

Batch Generation: ImageDataGenerator generates batches of images and their corresponding labels from a directory structure. It divides the data into mini-batches, which can be fed into the neural network during training. This mini-batch processing helps manage memory usage and accelerates training.

Flow from Directory: You can use the flow_from_directory method to load image data from a specified directory structure. The generator automatically organizes images into classes based on subdirectories and assigns class labels accordingly.

Custom Transformations: While ImageDataGenerator offers a range of built-in transformations, you can also define custom preprocessing functions and apply them to your image data.

How the CNN layers work in Transfer learning - very important¶

In [1]:
from IPython.display import Image, display

# Path to your image file
image_path = 'a.png'

# Display the image in the notebook
display(Image(filename=image_path))
  1. In transfer learning has more convolution steps that a normal CNN model

  2. Input is an image, output is a prediction

  3. The more convolutions layers you have, the more ocmplex patterns they can find

The idea behind building convolutional layers in transfer learning for Convolutional Neural Networks (CNNs) is to leverage pretrained convolutional layers (often from a model trained on a large dataset) as feature extractors. These pretrained convolutional layers have learned to capture low-level and high-level features from images, such as edges, textures, and object parts. By reusing these learned features, you can save computation time and data requirements while adapting the model to a different but related task.

Here's the general process of building convolution layers in transfer learning:

Load Pretrained Model: Start by loading a pretrained CNN model, such as VGG, ResNet, Inception, or MobileNet, that has been trained on a large dataset like ImageNet.

Freeze Layers: Freeze the convolutional layers of the pretrained model. This means that you prevent these layers from being updated during training. This is because you want to retain the knowledge captured by these layers.

Add Custom Head: Add a custom set of layers on top of the frozen convolutional layers. These custom layers, often fully connected (dense) layers, will serve as the head of the network and will be responsible for learning task-specific features and making predictions.

Fine-Tuning (Optional): Depending on the specific problem and available data, you can choose to unfreeze some of the pretrained convolutional layers (typically from later layers) and fine-tune them along with the custom head layers. Fine-tuning allows the model to adapt to the nuances of the target task while retaining knowledge from the source task.

Here's an example in Python using TensorFlow and Keras to demonstrate building convolution layers in transfer learning:

In [2]:
import tensorflow as tf
from tensorflow.keras.applications import VGG16
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# Load the pretrained VGG16 model without the top (fully connected) layers
base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))

# Freeze the convolutional layers
for layer in base_model.layers:
    layer.trainable = False

# Add custom head layers for the specific task
x = Flatten()(base_model.output)
x = Dense(512, activation='relu')(x)
output = Dense(10, activation='softmax')(x)  # Example: 10 classes for image classification

# Create the final model by combining the base model and custom head
model = Model(inputs=base_model.input, outputs=output)

# Compile the model with an optimizer and loss function suitable for your task
model.compile(optimizer=Adam(lr=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Print the model architecture
model.summary()
2023-10-02 11:18:30.242606: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/vgg16/vgg16_weights_tf_dim_ordering_tf_kernels_notop.h5
58889256/58889256 [==============================] - 2s 0us/step
WARNING:absl:`lr` is deprecated in Keras optimizer, please use `learning_rate` or use the legacy optimizer, e.g.,tf.keras.optimizers.legacy.Adam.
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 224, 224, 3)]     0         
                                                                 
 block1_conv1 (Conv2D)       (None, 224, 224, 64)      1792      
                                                                 
 block1_conv2 (Conv2D)       (None, 224, 224, 64)      36928     
                                                                 
 block1_pool (MaxPooling2D)  (None, 112, 112, 64)      0         
                                                                 
 block2_conv1 (Conv2D)       (None, 112, 112, 128)     73856     
                                                                 
 block2_conv2 (Conv2D)       (None, 112, 112, 128)     147584    
                                                                 
 block2_pool (MaxPooling2D)  (None, 56, 56, 128)       0         
                                                                 
 block3_conv1 (Conv2D)       (None, 56, 56, 256)       295168    
                                                                 
 block3_conv2 (Conv2D)       (None, 56, 56, 256)       590080    
                                                                 
 block3_conv3 (Conv2D)       (None, 56, 56, 256)       590080    
                                                                 
 block3_pool (MaxPooling2D)  (None, 28, 28, 256)       0         
                                                                 
 block4_conv1 (Conv2D)       (None, 28, 28, 512)       1180160   
                                                                 
 block4_conv2 (Conv2D)       (None, 28, 28, 512)       2359808   
                                                                 
 block4_conv3 (Conv2D)       (None, 28, 28, 512)       2359808   
                                                                 
 block4_pool (MaxPooling2D)  (None, 14, 14, 512)       0         
                                                                 
 block5_conv1 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv2 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_conv3 (Conv2D)       (None, 14, 14, 512)       2359808   
                                                                 
 block5_pool (MaxPooling2D)  (None, 7, 7, 512)         0         
                                                                 
 flatten (Flatten)           (None, 25088)             0         
                                                                 
 dense (Dense)               (None, 512)               12845568  
                                                                 
 dense_1 (Dense)             (None, 10)                5130      
                                                                 
=================================================================
Total params: 27,565,386
Trainable params: 12,850,698
Non-trainable params: 14,714,688
_________________________________________________________________

In the VGG16 model and many other pre-trained deep learning models, the include_top parameter is used to specify whether or not the top (fully connected) layers of the model should be included when loading the model. Here's what it means:

include_top=True: When include_top is set to True, it includes the original top layers of the VGG16 model. These top layers consist of fully connected layers that were originally designed for image classification on the ImageNet dataset. If you set include_top=True, you essentially get the complete VGG16 model, which is capable of performing image classification tasks on 1,000 classes.

include_top=False: When include_top is set to False, it excludes the original top layers of the VGG16 model. This is often used for transfer learning and feature extraction. By excluding the top layers, you can use the model as a feature extractor or as a base for building your own custom top layers. This is useful when you want to adapt the pre-trained VGG16 model for a different task, such as fine-tuning it on a specific dataset or using it as a feature extractor for a different type of neural network.

In summary, setting include_top=False in the VGG16 model allows you to use the convolutional layers of the model for feature extraction and further customization, making it a versatile tool for a wide range of computer vision tasks beyond its original classification task.

In [ ]: